Load Data

lfq <- read.csv('./data/nonimputed_lfq.csv')

Data Preprocessing

LFQ_KO <- lfq %>%
    select(contains('KO'))

LFQ_WT <- lfq %>%
    select(contains('WT'))

Data Mining

ggarrange(
    plothist(LFQ_KO, '- KO'),
    plothist(LFQ_WT, '- WT'),
    nrow=2,ncol=1
)

# skewness of nonimputed data
moments::skewness(lfq, na.rm = TRUE)
##     KO.22     KO.23     KO.24     WT.22     WT.23     WT.24 
## 0.6754195 0.6725489 0.5218363 0.7211065 0.6383786 0.7043256
# skewnes of imputed data
lfq.imp <- read.csv('./data/LFQ_raw_totals_imp.csv')
moments::skewness(lfq.imp)
## KO_TOTALS_22 KO_TOTALS_23 KO_TOTALS_24 WT_TOTALS_22 WT_TOTALS_23 WT_TOTALS_24 
##    0.5761698    0.5868388    0.5206581    0.6285968    0.5813767    0.6579983

The imputation doesn’t have influence on skewness of data. 🕺 Furthermore the imputation fixed it in some level, because the skewness coeff is lower for imputed data.

protti library

library(protti)

Data Preprocessing

Thanks to Mateusz, we had data almost ready for analysis with protti, but there were some important columns missing. So, based on Matis’ code, I transformed and selected features based on The Input Preparation Workflow from the protti documentation.

proteingroups <- readr::read_tsv("data/proteinGroups.txt", show_col_types = FALSE)

proteingroups <- proteingroups %>% filter(is.na(`Only identified by site`),
                         is.na(Reverse),
                         is.na(`Potential contaminant`))

to_protti <- proteingroups %>%
    select(`Protein IDs`, `Peptide sequences`, 
           contains('LFQ intensity') & contains('TOTALS') & (ends_with('22') | ends_with('23') | ends_with('24'))) %>%
    mutate(`Protein IDs`= paste('prot_', 1:nrow(proteingroups), sep='')) %>%
    pivot_longer(3:8, names_to = 'Sample', values_to = 'Intensity')%>%
    mutate(Sample = gsub('LFQ intensity ', '', Sample)) %>%
    mutate(Sample = gsub('_TOTALS_', '_', Sample)) %>%
    separate(col =  Sample, into = c("celltype","rep"), sep = "_", remove = F) %>%
    mutate(Condition = ifelse(celltype == 'KO', 'treated', 'control'),
           Intensity = ifelse(Intensity == 0, NA, log2(Intensity))) %>% 
    select(Sample, `Protein IDs`, `Peptide sequences`, Condition, Intensity) 


We can do most of the calculations from the protti library with data structured as above.

Normalization

The protti library gives us fxn to normalise data, but the function has only median method which is the original intensity minus the run median plus the global median.

normalised_data <- to_protti %>%
    protti::normalise(sample = Sample,
              intensity_log2 = Intensity,
              method = 'median')

normalised_data %>%
    group_by(Sample) %>%
    summarise(skewness = moments::skewness(normalised_intensity_log2, na.rm=TRUE))
## # A tibble: 6 × 2
##   Sample skewness
##   <chr>     <dbl>
## 1 KO_22     0.675
## 2 KO_23     0.673
## 3 KO_24     0.522
## 4 WT_22     0.721
## 5 WT_23     0.638
## 6 WT_24     0.704

Imputation

The library gives us 2 methods for imputation:

  1. method = "ludovic", MNAR missingness is sampled from a normal distribution around a value that is three lower (log2) than the lowest intensity value recorded for the precursor/peptide and that has a spread of the mean standard deviation for the precursor/peptide.
  2. method = "noise", MNAR missingness is sampled from a normal distribution around the mean noise for the precursor/peptide and that has a spread of the mean standard deviation (from each condition) for the precursor/peptide.

Before imputation, we have assign the missigness rows in our data. We can do that with assign_missingess() fxn from protti library. This fxn returns a inputed dataframe extended by:

  1. the comparison column contains the comparison name for the specific treatment/reference pair.
  2. the missingness column reports the type of missingness.

Types of missingness:

  • "complete": No missing values for every replicate of this reference/treatment pair for the specific grouping variable.
  • "MNAR": Missing not at random. All replicates of either the reference or treatment condition have missing values for the specific grouping variable.
  • "MAR": Missing at random. At least n-1 replicates have missing values for the reference/treatment pair for the specific grouping varible.
  • NA: The comparison is not complete enough to fall into any other category. It will not be imputed if imputation is performed.
# Assigning missing rows
data_missing <- normalised_data %>%
    assign_missingness(sample=Sample,
                       condition = Condition,
                       grouping = `Protein IDs`,
                       intensity = normalised_intensity_log2,
                       ref_condition = 'all')
Summary of assigning missigness rows

We can see that in the data there is still a lot of missing rows.

Method ludovic

ludovic_imp <- impute(
    data_missing,
    sample = Sample,
    grouping = `Protein IDs`,
    intensity_log2 = normalised_intensity_log2,
    condition = Condition,
    comparison = comparison,
    missingness = missingness,
    method = 'ludovic',
    skip_log2_transform_error = TRUE
)

The imputation is done, but we have too much missing data in the column and the fxn can’t impute a value there. The conclusion is that we still have a lot of missing values for peptides that didn’t have an LFQ value.

Method noise

noise_imp <- impute(
    data_missing,
    sample = Sample,
    grouping = `Protein IDs`,
    intensity_log2 = normalised_intensity_log2,
    condition = Condition,
    comparison = comparison,
    missingness = missingness,
    method = 'noise',
    skip_log2_transform_error = TRUE
)

Visualizations

Ludovic imputation method has less skewness coefficient. Important is that both imputation methods decreases the skewness coefficient.

Statistics

DIFFERENTIAL ABUNDANCE AND SIGNIFICANCE

In the tutorial, they calculating Differential Abundance and Significance using data prepared like above. We can run calculations on missing or imputed data. Like in tutorial we are using the moderate t-test to calculate this statistic.

All methods:

  1. t-test - Welch test.
  2. t-test_mean_sd - Welch test on means, standard deviations and number of replicates.
  3. moderated_t-test - a moderated t-test based on the limma package
  4. proDA - can be used to infer means across samples based on a probabilistic dropout model. This eliminates the need for data imputation since missing values are inferred from the model.
result <- ludovic_imp %>%
    calculate_diff_abundance(sample=Sample,
                             condition = Condition,
                             grouping = `Protein IDs`,
                             intensity_log2 = imputed_intensity,
                             missingness = missingness,
                             comparison = comparison,
                             filter_NA_missingness = TRUE,
                             method = 'moderated_t-test')

The result can be visualised with the volcano plot, which can also be interactive, but we can’t change it to suit our needs. Below is a manually generated volcano plot showing the significant proteins split by type of missigness, which also has an influence on the type of imputation.

ludovic_imp$mean <- ifelse(ludovic_imp$Condition == 'control', 
                           mean(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'control')], na.rm = TRUE),
                           mean(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'treated')], na.rm = TRUE))

ludovic_imp$sd <- ifelse(ludovic_imp$Condition == 'control', 
                           sd(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'control')], na.rm = TRUE),
                           sd(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'treated')], na.rm = TRUE))

ludovic_imp$n_samples <- ifelse(ludovic_imp$Condition == 'control', 
                           length(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'control')]),
                           length(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'treated')]))

result <- ludovic_imp %>%
    calculate_diff_abundance(sample=Sample,
                             condition = Condition,
                             grouping = `Protein IDs`,
                             intensity_log2 = imputed_intensity,
                             missingness = missingness,
                             comparison = comparison,
                             filter_NA_missingness = TRUE,
                             method = 't-test_mean_sd',
                             mean = mean,
                             sd = sd,
                             n_samples = n_samples,
                             ref_condition = 'all')

The volcano plot look strange for the result of the DAAS with t-test_mean_sd.

CORELATION OF VARIATION

Interesting Observation

When we run code bellow, we have data frame with 5138 rows.

lfq_long <- read.csv('./data/LFQ_long_raw.csv')


lfq_long %>%
    select(Sample) %>%
    filter(grepl('TOTALS', Sample) & (grepl('22', Sample) | grepl('23', Sample) | grepl('24', Sample)))%>%
    group_by(Sample) %>%
    summarise(test = length(Sample))
## # A tibble: 6 × 2
##   Sample        test
##   <chr>        <int>
## 1 KO_TOTALS_22  5138
## 2 KO_TOTALS_23  5138
## 3 KO_TOTALS_24  5138
## 4 WT_TOTALS_22  5138
## 5 WT_TOTALS_23  5138
## 6 WT_TOTALS_24  5138

But, when you read the original data from proteingroups.txt and make some basic filtration we have 5905 rows.